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ERIC User Please Note: 



This summary discusses all 5 parts of Information Storage 
and Retrieval (ISR-18), which is available in its entirety as 
LI 002 719. Only the papers from Part Three are reproduced here 
as LI 002 722. See LI 002 720 for Part One, LI 002 721 for Part 
Two, LI 002 723 for Part Four and LI 002 724 for Part 5. 

Summary 

The present report is the eighteenth in a series describing research 
in automatic information storage and retrieval conducted by the Department 
of Computer Science at Cornell University. The report covering work carried 
out by the SMART project for approximately one year {summer 1969 to summer 
1970/ is separated into five parts: automatic content analysis (Sections 

I to IV), automatic dictionary construction (Sections V to VII), user feed- 
back procedures (Sections VIII to XI) , document and query clustering methods 
(Sections XII and XIII), and SMART systems design for on-line operations 
(Sections XIV and XV) . 

Most recipients of SMART project reports will experience a gap in 
the series of scientific reports received to date. Report ISR-17, consisting 
of a master's thesis by Thomas Brauen entitled '’Document Vector Modification 
in On-line Information Retrieval Systems" w^s prepared for limited distribu- 
tion during the fall of 1969. Report ISR-17 is available from the National 
Technical Information Service in Springfield, Virginia 22151, under order 
number PB 186-13 r >. 

The SMART system continues to operate in a batch processing mode 
on the IBM 360 mode] 65 system at Cornell University. The standard processing 
mode is eventually to be replaced by an on-line system using time-shared 
console devices for input and output. The overall design for such an on-line 
version of SMART has been completed, and is described in Section XIV of the 
present report. While awaiting the time-sharing implementation of the 
system, new retrieval experiments have been performed using larger document 
collections within the existing system. Attempts to compare the performance 

o 



of several collections of different sizes roust take into account the 
collection "generality 11 . A study of this problem is made in Section II of 
the present report. Of special interest may also be the new procedures 
for the automatic recognition of "common" words in English texts (Section 
VI) , and the automatic construction of thesauruses and dictionaries for use 
in ar. automatic language analysis system (Section VII). Finally, a new 
inexpensive method of document classification and term grouping is 
described and evaluated in Section XII of the present report. 

Sections I to IV cover experiments in automatic content analysis 
and automatic indexing. Section I by S. F. Weiss contains the results of 
experiments, using statistical and syntactic procedures for the automatic 
recognition of phrases in written texts. It is shown once again that be- 
cause of the relative heterogeneity of roost document collections, and 
the sparseness of the document space, phrases are not normally needed 
for content identification. 

In Section II by G. Salton, the "generality" problem is examined 
which arises when two or more distinct collections are compared in a 
retrieval environment. It is shown that proportionately fewer nonre levant 
items tend to be retrieved when larger collections (of low generality) 
are used, than when small, high generality collections serve for evaluation 
purposes. The systems viewpoint t us normally favors the larger, low 

t 

generality output, whereas the user viewpoint prefers the performance of 
the smaller collection. • * 

The effectiveness of bibliographic citations for content analysij 
purposes is examined in Section III by G. Salton. It is shown that in 
some situations when the citation space is reasonably dense* the use of 



citations attached to documents is even more effective than the uie of 

standard keywords or descriptors. In any case, citations should be added 

»■ 

• to the normal descriptors whenever they happen to be available. 

In the last section of Part 1, certain template analysis methods 

> 

are applied to the automatic resolution of ambiguous constructions 
(Section IV by S. F. Weiss) . It is shown that a set of contextual rules 
can be constructed by a semi-automatic learning process, which will eventually 
lead to an automatic recognition of over ninety percent of the existing 
textual ambiguities. 

Part 2, consisting of Sections V, VI and VII covers procedures 
for the automatic construction of dictionaries and thesauruses useful in 
text analysis systems. In Section V by D. Bergmark it is shown that word 
stem methods using large common word lists are more effective in an infor- 
mation retrieval environment that some manually constructed thesauruses, 
even though the latter also include synonym recognition facilities. 

A new model for the automatic determination of "common” words 
(which are not to be used for content identification) is proposed and 
evaluated in Section VI by K. Bonwit and J. Aste-Tonsmann . The resulting 
process can be incorporated into fully automatic dictionary construction 
systems. The complete thesaurus construction problem is reviewed in Section 

VII by G. Salton, and the effectiveness of a variety of automatic dictionaries 
is evaluated. 

Part 3, consisting of Sections VIII through XI, deals with a 
r.mber of refinements of the normal relevance feedback process which has 
been examined in a number of previous reports in this series. In Section 

VIII by T, P. Baker, a query splitting process is evaluated in which input 

o 
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queries are split into two or more parts during feedback whenever the 
relevant documents identified by the user are separated by one or more non- 
relevant ones. 

The effectiveness of relevance feedback techniques in an environ- 
ment of variable generality is examined in Section IX by B. Capps and M. 

Yin. It is shown that some of the feedback techniques are equally applica- 
ble to collections of small and large generality. Techniques of negative 
feedback (when no relevant items are identified by the users, but only 
nonrelevant ones) are considered in Section X by M. Kerchner. It is shown 
that a number of selective negative techniques, in which only certain 
specific concepts are actually modified during the feedback process, bring 
good improvements in retrieval effectiveness over the standard nonselective 
methods . 

Finally, a new feedback methodology in which a number of documents 
jointly identified as relevant to earlier queries are used as a set for 
relevance feedback purposes is proposed and evaluated in Section XI by L. 
Paavola. 

Two new clustering techniques are examined in Part 3 of this report 
consisting of Sections XII and XIII. A controlled, inexpensive, single-pas 
clustering algorithm is described and evaluated in Section XII by D. B. 
Johnson and J. M. Lafuente. In this clustering method, each document is 
examined only once, and the procedure is shown to be equivalent in certain 
ci rcums tances to other more demanding clustering procedures. 

The query clustering process, in which query groups are used to 
define the information search strategy is studied in Section XIII by S* 
Worona. A variety of parameter values is evaluated in a retrieval environ- 
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pent to be used for cluster generation, centroid definition, and final 
search strategy. 

The last part, number five, consisting of Sections XIV and XV, 
covers the design of on-line information retrieval systems. A new 
SMART system design for on-line use is proposed in Section XIV by D. ,.nd 
R. Williamson, based on the concepts of* pseudo-batching and the interaction 
of a cycling program with a console monitor. The user interface and 
conversational facilities are also described. 

A template analysis technique is used in Section XV by S. F. Weiss 
for the implementation of conversational retrieval systems used in a time- 
sharing environment. The effectiveness of the method is discussed, as 
well as its implementation in a retrieval situation. 

Additional automatic content analysis and search procedures used 
with the SMART system are described in several previous reports in this 
series, including notably reports ISR-11 to ISR-16 published between 1966 
and 1969. These reports are all available from the National Technical 
Information Service , i Springfield, Virginia. 



G. Salton 
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VIII. Variations on the Query Splitting Technique 
with Relevance Feedback 

T. F. Baker 



Abstrac t 

Some experiments in relevance feedback are performed with variations 
on the technique of query splitting. The results obtained indicate that these 
variations, as tested, offer no significant improvement over previously 
tried methods of query splitting. 




1. Introduction to Query Splitting 

In a document retrieval system with relevance feedback, quer y 
s plitting refers to the creation of multiple queries from a single previous 
query, making use of user relevance judgments on documents retrieved by 
that query in a previous search. The intention in generating these multiple 
queries is to allow the search to be directed t'-ward several individual 
clusters of relevant documents, a necessary assumption being that th^se 
clusters exist and do contain relevant documents which have not been pre- 
viously retrieved. 

There is little doubt that in a situation where several clusters 
of relevant documents are retrieved in the initial search it is desirable 
to generate multiple queries for succeeding iterations. The problem 
remaining is to distinguish this condition from those in which the relevant 
documents are unclustered or fall into a single cluster. 

Borodin, Kerr, and Lewis [ 1 ) propose one method, Theii^ algorithm 
makes use of the average interdccumer.t correlat ion among the relevant docu- 
ments available for feedback as a cutoff In determining whether a given pair 
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of documents should be split. The results obtained with this algorithm are 
inconclusive, but indicate that it is not sufficiently selective,' 

Ide [2] suggests that a more sophisticated algorithm might look 
for separation of relevant documents by nonrelevant documents within the docu- 
ment space, splitting a pair of documents if and only if there exists a non- 
roievant document more highly correlated with each of them than they are 
with each other. In certain respects, this separation criterion is more 
faithful to the conceptual basis of query splitting than the average corre- 
lation criterion. Unlike the average correlation criterion, the separation 
criterion takes into account the distribution of the nonrelevant doer ..erit s . 

This may be significant, since what is desired is the detection of clusters 
of relevant documents. Jn contrast, what the. average correlation criterion 
does is to cluster relevant documents. Since nonrelevant documents are not 
taken into account, this will not produce legitimate clusters, in terms of 
the whole document space, when relevant documents locally outnumber nonrelevant 
documents, or vice versa. For this reason it would seem that Ide's untested 
separation criterion deserves more attention. 

The usual concept of query splitting, as discussed by Borodin, Kerr, 
and Lewis and by Ide, is limited in application to cases where more than one 
relevant document is retrieved by a previous search iteration. It seems that 
if query splitting is of any value, something similar could be done for the 
queries which do not retrieve enough relevant documents to consider splitting 
in the usual sense. After all, these are generally the queries most in 
need of modification. What is needed is a dual to the usual formulation of 
query splitting — a technique of clustering nonrelevant documents for the 
generation of multiple queries through negative feedback. 
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2. Algorithms for Query Splitting 

Since the algorithm of Borodin, Kerr, and Lewis [1) using the average 
correlation criterion has been shown to be largely ineffective on the SMART 
document collections available, and because the separation criterion of Ida 
[2] remains untried, the primary algorithm tested in this study makes us of 
the separation criterion. 

Since a pair splitting criterion does not by itself define a set of 
clusters, but rather an association matrix, a splitting algorithm may addi- 
tionally choose between the use of multilevel associations and the use of 
direct associations for generating clusters. An examination of the document 
and query collections used here immediately discloses that multilevel asso- 
ciation virtually eliminates cases of splitting in positive feedback. There- 
fore in order to facilitate experimentation, the splitting algorithm is 
weakened by permitting only directly connected pairs within clusters used 
for positive feedback. 

Adding to this constraint the requirement that all clusters be maximal, 
the two conditions are sufficient to define for any pair splitting criterion 
a unique set of clusters (not necessarily disjoint). 

The actual application of these clustering conditions for experimen- 
tation with the ADI Abstracts-Thesaurus and Cru.nfield 200-Thesaurus collec- 
tions is performed manually using document-document correlat i .Vis computed by 
the SMART system. To allow combining the results of the split queries in 
a consistent fashion, the number of clusters generated for each query (in 
cases where more would be generated) is limited to two by joining the pair 
of docume r .s which most nearly fails to pass the separation criterion. The 

resulting pairs of clusters are f**d to the SMART normalized relevance feed- 
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back facility in successive iterations. 

The SMART relevance feedback formula used is: 



Q f = MQ + 



M 

n 



n 

l 

i=l 



R. 

1 



M 

m 



m H. 

X fin 

i=l 1 



where Q 1 is the new query; Q is the original query; M is an integer 

constant; n is the number of relevant documents (R^J fed back; m is the 

number of nonrelevant documents (N.) fed back. 

i 

The top ranking seven documents according to the first "half” of 
the split query are frozen in place, while the succeeding ranks are deter- 
mined by another search iteration with the other "half" of the split query. 
This is done with the cwo "halves” reversed, as well, so as to average out 
the effects of order. 

The procedure described is applied to all queries retrieving more 
than one relevant document in the top five ranks according to the first 
search . 

For those queries not retrieving sufficient relevant documents to 
be split for positive feedback, splitting in negative feedback is attempted. 

Where one relevant document is known, the dual to the separation 
criterion is tried, splitting pairs of nonrelevant documents that are more 
similar to the one relevant than they are to each other. The resulting 
clusters of nonrelevant documents are treated like the clusters of relevant 
documents <bove, with the single relevant document additionally being fed 
back with each "half" of the split query. 

Where no relevant documents are known, nonrelevant documents arc 
separated by correlation less than the average correlation between documents 
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Sample separation criterion in its weak form applied to query 
Q2 j 0 of the CPN2TH collection, which retrieved three rele* a:«t 
documents out or five on the firnt search. The relevant docu- 
ments retrieved are 3, 115, and 197. The two nonrelevant are 
7 and 160. 

The interdocument correlation matrix is (in part): 

3:115 0.5744 3:197 0.4700 3:160 0.4828 3:7 0.2208 

115:197 0.7926 lli:160 0.5797 115:7 0.3179 
197:160 0.5506 197:7 C.3136 

The pa ; of relevant documents which must be split for feedback 
purposes Lecause they are separated by a nonrelevant document is 
3:197, which is split by 160. 

The remaining associations are 3 115 

197 , 

and the two derived clusters are 3-115 and 115-197. 

Separation Criteria for Query Q250 
Example 1 
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Sample separation criterion applied to query B04 of the ADIABTH 
collection, which retrieves two relevant documents out of the top 
five ranks according to the first search. The tvro relevant docu- 
ments retrieved are 33 and 20. The nonrelevant documents are 5, 

46, and 62. 

The interdocument correlations are: 

33:20 0.1097 33:5 0.4843 33:46 0.2000 33:62 0.2026 
20:5 0.2292 20:46 0.1073 20:62 0.0593 

Although it might be interesting to split the nonrelevant documents, 
there are relevant ones here to split, and the nonrelevant ones 
are therefore used only to split relevant pairs. We see that the 
pair 33:20 is split by 5, since 0.4843 and 0.2292 are both greater 
than 0.1097. Thus 33 and 20 are separated for feedback purposes, 
and since they are the only relevant documents available they 
are the two clusters which wiJl be used. 

Separation Criterion for Query B04 
Example 2 
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Application of the weak separation criterion to query A13 of the 
ADIABTH collection j which retrieves one relevant document in the 
five top-ranked documents according to the first search; the 
relevant aocument is 37 and the nonrelevant are 12, 21, 39, and 
60. 



The interdocument correlation matrix is: 



37:12 03411 



37:21 0*3059 
12:21 0.3800 



37:39 0.3225 
12:39 0.3769 
21:39 0.1741 



37:60 0.4000 
12:60 0.3412 
21:60 0.5066 
39:60 0.1061 



The following pairs of documents are more highly correlated 
with 37 than they are with each other, and therefore are 
separated: 39:60; 21:39. 



The remaining associations may be summariztd: 
12 39 



21 60 

Thus the resulting clusters of nonrelevant documents are: 
12 and 12 39 . 




21 60 

Separation Criterion for Query A13 
Example 1 
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Applicatijn of the weak separation crilerion to query Q189 of the 
CRN27H collection, which retrieves one relevant out of five top- 
ranking documents according to the first search. The relevant document 
is 148 and the nonre levant are 6, 33 , 144 , and 169. 

The interdocument correlation matrix is: 

148:6 0.1782 148:33 0.4881 148:144 0.6491 148:169 0.1816 

6:33 0.1630 6:144 0.1347- 6:169 0.2686 

33:144 0.5682 33:169 0.1218- 

144:169 0.0783 

The follow’* j pairs of documents are more highly correlated with 
148 than they arc with each other, and therefore are separated for 
feedback purposes; 144:169; 33:169; 6:144; 6:33. 

The remaining associations may be summarized: 

6 33 



144* *169 

The clusters of nonrelovant documents used for feedback are then: 
6 33 and 144 169 . 

Separation Criterion for Query Q189 
Example 4 
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Application of the average correlation criterion to query Q182 
of the CRN2TH collection, which retrieved no relevant documents 
in the top five ranks on the initial search: 

Document-document correlations for the nonrelevant documents are: 

39:112 .5367 39:164 .0100- 39:167 .0100- 39:179 .5696 

112:164 .1142 112:167 .1358 112:179 .6980 

164 ; 167 .7212 164:179 .2487 

167:179 .1660 



The average correlation is 0.3190, 

Thus the only associations permitted are: 

39 112 164 167 



179 

which are the resulting clusters. 



Correlation Criterion for QUERY Q182 
Example 5 
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Application of the average correlation criterion to query Q266 o 
the CRN2TH collection , which retrieved no relevant documents on 
the initial search. 

The nonrelevant documents known are 58, 162, 163, 164, and 165. 
The interdocument correlations are: 



58:162 


.3932 


162:163 .3240 


58:163 


.3368 


162:164 .5679 


58:16-1 


.3662 


162:165 .5745 


58:165 


.4113 





163:164 .5194 164:165 .4585 

163:165 .3744 



The average is 0.2819. 



Thus the only permissable associations are 162:164, 
164:165, 164:163, 162:165. 



Thus the clusters are: 



162 164 58 



165 163 



Correlation Criterion for Query Q266 
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Example 6 
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retrieved, and clusters are formed using multilevel associations (direct 
associations generally failing to produce any grouping at all). The 
clusters so formed are fed back in a manner similar to the clusters 
derived by the other two methods. 



Results obtained with these three algorithms on the ADI Abstracts- 



Thesaurus and Cranfield 200-Thesaurus collections are summarized in the 
following section. 

3. Results of Experimental Runs 

The tables on the following pages summarize the results of runs 
made in the SMART system with splittable queries of the three categories 
mentioned in the preceding section for the ADI Abstracts-Thesaurus (£2 
documents and 35 queries — denoted by ADIABTH) and Cranfield 200 (200 
documents and 42 queries — denoted by CRN2TH) collections. 



The following conventions apply: 

indicates that the results of the split query and control 
runs are ir.dist inguishable in terms or the number of 
relevant documents retrieved. 

* indicates that all relevant documents are retrie/ed 

and no improvement is possible. 

0 indicates that neither run retrieved any relevant 

documents . 

indicates that this query would also have split according 
to the stronger version of the splitting requirement. 

@ indicates a keypunching error detected too late to correct 

in one of the feedback document specifications for the 
trial run. 
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Queries from ADIAB1H collection retrieving more than one relevant 
document on the first search, and spZittable by the weak separation 
criterion: 

Improvement of spJ it queries over ordinary 
normalized positive feedback in terms of 
relevant documents retrieved up to rank: 



Query 


5 


10 


i! 


20 


A03/a ' 


- 


- 


sV 


it 


A03/b' 


- 


- 


it 


it 


A15/a‘ 


-1 


- 


- 


-1 


A15/b ’ 


-1 


- 


1 


- 


B04/a ' 


- 


1 


- 


- 


B04/b ’ 


1 


1 


- 


_ 



Average : 



-0.5 0.33 0.17 -0.17 



Query Splitting Results for ADIABTH Collection 
(FOSNEG) 







Table 1 
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Queries from the CRP2TH collection retrieving more than 
one relevant document on the first search, and splittable by 
the weak separation criterion: 

Improvement of split queries over 
ordinary normalized positive feedback 
in terms of relevant documents retrieved 
up to rank: 



Query 


_5 


10 


15 


20_ 


0122/a' 


- 


1 


- 


- 


Q122/b ‘ 


-1 


1 


- 


- 


Q148/a 1 


-2 


- 


- 


- 


Q148/b‘ 


-2 


- 


- 


- 


@ Q250/a 


- 


- 


- 


- 


@ Q250/b 


- 


- 


- 


- 


Q268/a 


- 


- 


* 


it 


Q268/b 


- 


- 


it 


* 


Q269/a ‘ 


- 


- 


it 


it 


Q269/b‘ 




- 


it 


it 


Average: 


-4/10 


2/10 







o 

ERIC 



Query Splitting Pesuits for CRH2TH Collection 
CSPLPOS) 

Table 2 
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Queries from the CRN2TH collection retrieving more than one 
relevant document on the first search* and splittable by the 
weak separation criterion: 

Improvement of split queries over ordinary 
normalized positive and negative feedback 
in terras of relevant documents retrieved 
up to rank: 



Query 




10 




20. 


Q122/a - 


- 


1 


-1 


- 


Q122/b ‘ 


-1 


1 


-1 


- 


Q148/a ’ 


-1 


- 


- 


- 


Q148/b‘ 


-2 


“ 


- 


- 


@ Q250/a* 


-1 


-1 


-1 


-1 


@ Q250/b * 


-1 


-I 


-1 


-1 


Q268/a 


- 


- 


* 


ft 


Q268/b 


- 


- 


»V 


;V 


Q269/a 1 


-1 


- 


H 


s’f 


Q269/b‘ 


-1 




ft 


ft 


Average : 


o 


0 


-4/10 


-1/10 



Note: This comparison is unfair to the split query run, 

since it made no use of negative feedback information. 
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Query Splitting Results for CRN2TH Collection 
( 'jPLPGS ) 



Table 3 



Queries from the CRN2TH collection retrieving more than one 
relevant document out of five retrieved on the first search 
and splittable by the weak separation criterion: 



Improvement of split queries over 
ordinary normalized positive and nega- 
tive feedback in terms of relevant 
documents retrieved up to rank: 



Query 


5_ 


10 


15_ 


2£ 


Q122/a ' 


- 


- 


-2 


-1 


Q122/b * 


- 


1 


-1 


-1 


Q148/a* 


- 


- 


- 


- 


Q148/b ' 


- 


- 


- 


- 


@ Q250/a‘ 


- 


-1 


-1 


-1 


@ Q2b0/b ' 


- 


-1 


-1 


-1 


Q268/a 


- 


- 


* 


ft 


Q268/b" 


- 


- 






Q269/a ' 


- 


- 


* 


.*• 


Q269/b 


- 


- 






Average 




-1/10 


-5/10 


-4/10 



Note: Unlike the other tests > this run was done wi t .•-> t irst 

five documents retrieved by the initial sea:c^ , en in 
their rank positions. It is also unfair t\. -qi.it 

query lun, since the control m^de use of i .eg ^dback 
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Query Splitting Resuits for CRN2TH Collection 
( SPLPOS) 



Table 4 
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Queries from ADIABTH collection retrieving no relevant documents 
on the first search, and splittable by correlation less than 
average : 

Improvement of split queries over ordinary 
normalized negative feedback in relevant 
documents retrieved up to rank: 



Quei^y 5_ 

AO 8 /a 0 
A08/b -1 
A09/a 0 
A09/b 1 
Bll/a 0 
Bll/b 0 
B13/a 0 
B13/b 0 
BIS/a 0 
B15/b 0 



10 15 20 



0 

0 -1 

1 

0 -1 -3 

0 -1 -2 



Average : 



0 -1/10 -1/5 -1/2 



Query Splitting Results for ADIABTH Collection 
(ALLNL’G) 




Table 5 
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Queries from the CRN2TH collection retrieving no relevant 
documents in the top five ranks for the first search, and 
splittable by correlation below average. 

Improvement of split queries over ordinary 
normalized negative feedback with only the 
top-ranked nonrelevant document used (as 
opposed to the previous run which used 





all five 


nonre levant 


available ) 


in terms 




of relevant documents 


retrieved 


up to 




rank : 








Query 


5_ 


10 


15 


20 


Q079/a 


_ i 


-1 


-1 


- 


Q079/b 


-1 


-i 


-1 


- 


Q126/a 


0 


1 


* 




Q126/a 


0 


1 




* 


Ql; 2/a 


- 


- 


1 


- 


3132/b 


-1 




- 


-1 


Q182/a 


- 


- 


- 


- 


Q182/b 


-1 


- 


- 


— 


Q266/a 


0 




3 


2 2 


Q266/b 


0 


2 


a 


3 3 


Q323/a 


- 


- 


- 


- 


Q323/b 


- 


- 


- 


- 



Average : 



--/ i : 2/12 6/12 




Q lery Splitting Results for CRN2TK Collection 
(::0RrLS) 

Tat ie 6 



3u 



4/12 
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Queries from the CRN2TH collection retrieving no relevant 
documents in the first five ranks for the first search, and 
splittable by correlation below average. 

Improvement of split queries over ordinary 
normalized negative feedback in terms of 
relevant documents retrieved up to rank: 



Query 



5 10 15 



20 



Q079/a 

Q079/b 

Ql26/a 

Q126/b 

Q132/a 

Ql32/b 

Q182/a 

Q182/b 

Q266/a 

Q266/b 

Q323/a 

Q323/b 



0 0 0 

0 0 0 

0 - * 

0 - * 

-1 

-1 -1 -1 

-1 

0 1 - 

0 2 1 



-1 



1 



Average : 



- 2/12 1/12 0 



Query Splitting Results for CRN2TH Collection 
(NORELS) 




Table 7 
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Queries from the ADIABTH collection retrieving one relevant 
document in the top-ranking five on the first search, and 
splittable by weak separation criterion for nonrelevant 
documents. 



Query 

@ A01 * 

@ A02 * 
A04 * 
A06 * 
A07 
A10 
All 
A12 * 
A13 
A14 * 
A17 
B16 * 



Improvement of split queries over ordinary 
normalized positive and negative feedback 
in terms of relevant documents retrieved 
up to rank: 



5 10 



-1 

-1 



1 

1 



15 



-1 

-1 



ft 



-1 



20 



-2 



1 



-1 



1 



Average : 



-3/24 2/24 -3/24 -2/24 



o 

ERIC 



Query Splitting Pesults for ADIABTIi Collection 
(SPJ.NEG) 



Table 8 



3o 
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Queries from the CRN2TH collection retrieving only one 
relevant document in the top five ranks on the first sea. "h, 
and splittable by the weak separation criterion for nonrexe- 
vant documents. 



Improvement of split queries over ordinary 
normalized positive and negative feedback 
in terms of relevant documents retrieved 
up to rank: 



Query 


5_ 


10 


15_ 


2£ 


Q123/a 


- 


- 


- 


- 


Q123/b 


- 


- 


- 


- 


01 30 /a 


- 


-1 


it 


ft 


Q130/b 


- 


-1 


-1 


-1 


Q141/a 


it 


* 


it 




W141/b 


it 


it 


it 


it 


Q170/a 


- 




- 


- 


Q170/b 


- 


- 


- 


- 


Q189/a ' 


it 


it 


it 


it 


QlG9/b ’ 


it 


it 


it 


it 


Q272/a 


-1 


- 


it 


it 


Q272/b 


- 


- 


-1 


it 


Avei'age : 


-1/12 


-2/12 


-2/12 


-1/12 



Query Splitting Results for CRIJ2TH Collection 
(ONEREL) 




Table 9 
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Queries from the CRN2TH collection retrieving only on • 1 elevant 
document in the top five ranks on the first search, and split- 
table by the weak separation criterion for nonrelevant documents. 

Improvement of split queries over ordinary 
normalized positive feedback in terms 
of relevant documents retrieved up to 
rank : 



Query 


5^ 


10 


15_ 


20 


Q123/a 


- 


- 


- 


- 


Q12?/b 


- 


- 


- 


- 


Q130/a 


- 


- 


1 


- 


Q130/b 


- 


- 


- 


-1 


Q].41/a 


it 


ft 


* 


* 


Q141/b 




ft 


* 


ft 


Q170/a 


- 


- 


- 


- 


Q170/b 


- 


- 


- 


- 


Q189/a‘ 


ft 


w 


ft 


* 


Q189/b ‘ 


ft 


ft 


it 


* 


Q272 /a 


- 


- 1 


it 


* 


Q272/b 


1 


-1 


-1 


A 


Average : 


1/12 


-2/12 


0 


-1/12 



Query Splitting Results for CRN2TH Collection 
(ONEREL) 
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Table 10 
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Correlations 




0 


1 


2 


3 


: 0.4523 


0.6175 


0.8241 


0.2811 


0.3^80 


0.5598 


0.4165 


0.1603 


0.3665 


0.3343 


0.4137 


0.3070 


J . 3647 


0.3325 


0.3680 


0,5045 


0.3638 


0.2731 


0.3518 


0.4064 


0.3467 


0.2363 


0.3412 


0.8449 


0.3333 


0.2347 


0.3246 


0.1727 


0.3283 


0.2334 


0.2994 


0.4017 


0.3119 


0.2206 


0.2994 


0.3635 


0.3086 


0.2141 


0,2946 


0.3530 


0.3000 


0.2130 


0.2930 


0.3529 


0.3000 


0.2092 


0.2908 


0.3390 


0.2949 


0.2033 


0.2768 


0.3360 


0.2949 


0.2001 


0.2758 


0.3356 


0.2917 


0.1763 


0.2668 


0.3350 


0.2673 

0.2080 

0.1803 

0.1793 


0.1606 

0.1547 

0.1418 


0.2488 

0.2482 

0.2415 


0.2109 

0.1705 

0.1607 

i 






Doc. Corr 


Cent. Corr 


Drop Doc 


Corr. Rank 


Old . Doc 


Old Reldoc 


New Doc 


Run 0 


82 


0 


17 


65 


0 


0 


0 


Run 1 


82 


0 


31 


51 


0 


0 


5 


Run 2 


! 82 


0 


12 


70 


5 


2 


0 


Run 3 


75 


0 


14 


61 


5 


2 


7 



0 initial search 

1 control run with positive and negative feeohack 

2 first "half" of split query 

3 second "half* 1 of spilt query 



Sample Output for Query 104 




rig. i 
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4. Evaluation 

In general, the results for query splitting in positive feedback 
with the separation criterion are comparable to those achieved by Borodin, 

Kerr and Lewis in their experiments with the average correlation criterion. 
Although a slight improvement may be noted for split queries over ordinary 
feedback, it is net predictable enough to justify the use of query split- 
ting in a working retrieval system. 

Only if a more selective method can be devised for determining which 
queries will benefit from splitting will th° technique become of practical 
value. Merely strengthening the splitting requirement by permitting multi- 
level associations in cluster formation appears to be of some value in 
eliminating nonproductive splitting in the queries tested. All queries 
split by the weaker method which show an improvement under splitting would 
be split in like manner by the stronger method. Strengthening the separa- 
tion as well, by providing that pairs be separated only if they exceed the 
requirements by some margin, may also be of value in restricting the number 
of undesired splits. 

For negative feedback, the situation is worse. The onl> run in 
which splitting exhibited any improvement over the usual negative feedback 
was on queries ir. the Cranfield collection retrieving no relevant documents 
in the first search. Even there, the improvement was erratic. This failure 
of splitting applied to negative feedback is not entirely surprising, since 
the hypothesis of separate clusters of relevant documents used to justify 
splitting in positive feedback does not apply. Here the test just if i :at ion 
f oi splitting is that, since the locations cf no relevant documents are known, 
multiple queries may offer more chance of success by means of a "shotgun" 
effect — scattering the search over a lai ger area of the document space. 

4 
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Altogether, the results of the negative feedback runs indicate that the 
different "halves" of the split queries do not usually retrieve signifi- 
cantly different portions of the document space. Thus it would seem that 
this "shotgun" eifect is not taking place. It may be that results can be 
improved by weighting the nonrelevant documents more heavily in feedback. 
In any case, negative query splitting as tested does not appear to benefit 
from this effect sufficiently to justify the effort of multiple query 
generation.* 



it is important in viewing them to consider that the queries tested were 
written by experts in their fields and are therefore generally consistent, 
thus making the probability of success in query splitting rather low. 

Also, being small, the document collections used are inimical to the exis- 
tence of multiple clusters of relevant documents. Relevant documents in 
such small collections tend to fall into single clusters, or none. Although 
the success of query splitting in these adverse circumstances would be a 
strong argument in its favor, its failure in the same circumstances is less 
conclusive. It would appear that if truly significant results are to be 
achieved with query spli ting, they will be achieved in the environment of 
a larger more diverse document collection and with more realistically incon- 
sistent queries. 



* The only exception to this is query Q266 of CRN2TH, which showed remarkable 
improvement on splitting. 



Although the results of these experiments are largely negative, 
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Effectiveness of Feedback Strategics 
on Collections of Differing Generality 

B . Capps and M . Yin 



Abstract 

This study evaluates the comparative effectiveness of 
several feedback strategies on collections which differ in 
generality, namely the Crnnfield 200 and Cranfield 400 col- 
lections. A new query set which produces a constant number of 
relevant documents over the two collections is used to regulate 
the generality. ihe results are assessed from both the user 
and the system viewpoint; some strategies do appear equally 
effective on both collections. 

1 . In t r odu c t i on 

The ultimate goal of automatic information retrieval 
systems is to obtain a performance in n real life" situations 
equally as good as or better than in manual systems under 
operational conditions. Experiments done on automatic systems 
such as SMART are performed on controlled and limited collec- 
tions. Therefore, in order to predict how the system will 
perform in a library situation, experiments on collections of 
different sizes are done and the results compared to see if 
there is a significant loss in performance as larger collec- 
tions are used. 

Generality is the proportion of relevant documents in a 
collection to total number of documents. In collections of 



varying sizes, generality is expected to differ, because the 
number of relevant documents does not increase proportionally 
to the number of nonrelevant documents. Therefore, results from 
test collections of different generality can be viewed as an 
indication of how the results from a test environment would be 
reflected in a real life situation. 

This study is concerned with the relevance feedback 
aspect of information retrieval. Relevance feedback is one 
of the ways to utilize user opinion in improving search effec- 
tiveness fl], A set of documents is given to the user who judge 
which documents are relevant to his request. This information 
is then used to modify his original query for another search 
through the collection. The rationale is that the original 
query might be badly worded, so that the incorporation of 
concepts from documents Judged relevant might retrieve other 
related documents. 

The method used in this study is to run several search 

strategies on collections of different generality and then to 

/ 

compare the retrieval performances. Several means are available 
to measure retrieval performance depending on the viewpoint 
taken. The recall-precision graph is used to represent the user 
viewpoint of how well the system is satisfying his needs. 

However, this is not adequate to measure system efficiency; 
consequently, fallout and adjusted precision have been developed 
Fallout is the proportion of nonrelevant documents retrieved 
over total number ot nonrelevant documents in the collection. 
Uh^n plotted against recall, this takes into account how much 



work the system has to do to retrieve equivalent numbers of 
relevant documents. When fallout is constant, precision can be 
adjusted to take generality into account so that the precision 
from collections of different generality can be compared on an 
equal basis [3]. 

2. Experimental Environment 

The test collections should be similar in all respects 
except for generality. Ide I" 2 ] cites four factors which might 
account for the differences in results of the two collections 
she used — Cran 200 and ADI; 

a) difference in subject matter 

b) difference in collection scope 

c) difference in variability within collection 

d) difference in query construction and relevance 

j udgme n t . 

The CRN2NUL and CRN4NUL collections seem to eliminate these 
factors since they are subcollections of a homogeneous set “ 
Cranfield 14,00 — and are not mutually exclusive subcollections. 

To vary the generality, the number of relevant items is 
held constant while the number of nonrelevant items varies. 

This can be done by creating a new query collection from the 
original CRN 2NUL QUESTS and CRN4NUL QUESTS collections. The 
selected queries have the Lame relevance decisions in both the 
Cran 20U and Cran ^00 collections. There are twenty-two such 
queries with a total of one-hundred and fifteen relevant 
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documents, The formula for generality, averaged over all 
queries, is: 



Generality 



total relevant in collection 

® number of queries 

total number of documents in collection 



[ 4 ] 



The generality for Cran 200 with respect to this new query 
set is 26.14 and for the Cran 400 is 12.30. 

The query-update formula used for relevance feedback 

i s : 



Q i+ i = * Q i + 



ojQ + 



m i n ( n , n ’ } 
a 9 r 

aj r . 

i 1 



m i n ( n, , n 1 ) 
b s 

vl s i 

1 



[ 2 ] 



where 



tt , u>, a , p 



*1 

^0 



s 

n 



i 

a 



n 



b 



are mult iplier s 
is updated query 
is previous query 
is original query 

is number of relevant documents retrieved 

is relevant document retrieved 

is number of nonrelevant documents 
retrieved 

is nonrelevant document retrieved 

specify the number of documents to be 
used. 



Various strategies can t>e formulated using the above 
equation with the added parameters in the SEARCH routine of the 
SMART system, such as ALL0F, ATLEST and NOMOR. ALLOF is the 
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number of documents to be retrieved. ATLEST is the minimum 
number of documents to be used in feedback and NOMOR is the 
maximum number of documents to be searched to provide docu- 
ments for feedback. Only one iteration of feedback is used in 
this study because the most noticeable effect of feedback results 
from this iteration [5]. A frozen feedback iteration is used 
to eliminate the ranking effect for evaluation purposes. 

Since the purpose of the experiment is to study the 
overall effect of feedback on these collections, a wide range 
of strategies are chosen: 

Strategy 1 is positive feedback 

Strategy is the M dec hi" strategy [2] 

Strategy A is a modified ’’dec hi” strategy which uses 
a nonrelevant document for feedback only when 
no relevant documents are retrieved 

Strategies 3 and 5 use varied multipliers. 

The actual parameters are shown in Table 1. 

This study attempts to determine whether feedback im- 
proves retrieval in one collection more than the other. That 
is, the initial full search results serve only as a base line 
and the improvement after using feedback is the result to be 
measured. Consequently, the following performance measures 
are stressed: 

a) Precision improvement — - Pq 

This indicates whether a particular strategy is 
better for one collection than the other from a 
user v icwpo in t . 

4 j 
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Stra tegii.s 
1 2 3 A 5 

2 
5 
1 
5 
1 
0 
5 

Parameters for the Feedback Strategies 
Table 1 



Original Multiplier 
Positive Rank Cut 
Positive Multiplier 
Negative Rank Cut 
Negative Miltiplier 
Nega t i ve At Least 
Negative Nc More 



n (ALLOF) 
a 



n, ( $ALL0 F) 
0 

y 

$ ATLS T 
$NOMOR 



1 

10 

1 



1 

5 

1 

1 

-1 

1 

10 



1 1 

5 5 

4 1 

5 1 

-1 -1 

0 1 

5 5 




50 



IX- 7 



o 

ERLC 



P is precision of initial search 
P^ is precision of feedback iteration 
These are taken at fixed recall points. 



b) 



Percentage of precision improvement 




x 100% 



This takes into account the fact that precision is 
better for a collection with a higher generality 
number [4] by taking the difference with respect 
to the original precision. 



c) Fallout improvement — Fq - 

A performance improvement implies that fallout for 
the feedback iteration is less than fallout for the 
initial search. This equation is equivalent to 
(- 1) x (F^ - Fq) ard multiplying by -1 serves 
to transform the difference onto the positive scale. 



Fq is fallout of initial search 

is fallout of feedback iteration 
These are taken at fixed recall points. 



d) 



Percentage of fallout improvement 




100 % 



This takes into account the fact that the fallout is 
not the same for the initial searches on both 
collecticns. Therefore, the difference is com- 
puted as a percentage of the original. 



R 1 x °2 

e) Adjusted precision - P A = ■ ( R - x ~ G - ,- + (T oF 0 ~ "g^T 1 4 1 

Precision of the Oran 200 is adjusted to that of 
Cran 400 and not vice versa, because the emphasis 
of this study is on performance of larger collections. 



R., is fixed recall points 

F i is fallout of Cran 200 at recall 

G ^ is generality of Cran 400 



5 * 



In this manner, the results from two collections of 
different generality can be compared ot, an equil 
basis. This comparison is from a system viewpoint. 

f) Adjusted precision improvement — F . - P. 

1 A o 

Similar to a). 



g) 



Percentage of adjusted precision improvement — 




x 100% 



Similar to b). 



3. Experimental Results 

The results seem to fall into two categories with stra- 
tegies 2 , 3 and S in one group (group A) and strategies 1 and 4 
in the other group (group B) . The former group consistently 
shows a good performance for Cran 200, but there is little 
improvement for Cran 400, whereas the latter group shows an 
equivalent improvement. The average improvements for one stra- 
tegy from each group are shown in Table 2. 

In group A from both a system and a user viewpoint, the 
Cran 200 performs better as can be seen in all the improvement 
graphs. In fact, for strategy 3, Cran 400 performs worse using 
feedback than for the full search as shown by the negative values 
in the precision improvement curve (Fig. la) and the percentage 
fallout improvement curve (Fig. 2b). This result seems to indi- 
cate that this class of feedback strategies will not perform 
well in a library situation. For strategies 3 and 5, the result 
is probably due to the large number of nonrelevant documents 



0 




Strategy 1 
Cran 200 Cran 400 



Strategy 3 
Cran 200 Cran 400 



Precision improvement 


. 0293 


.0307 


% of Precision improvement 


12.50 


16 .99 


Fallout improvement 


.0104 


o 

"-si 

o 


% of Fallout improvement 


13.38 


15.94 


Adj Precision improvement 


. 0194 




% Adj Precision improvement 


24.21 





.0526 .0218 

19.38 12.73 

.0130 .0066 

21.73 12.16 

. 0379 
23.79 



Average Improvement Results for Strategies 1 & 3 



Table 2 
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fallout improvement graphs. 

STRATEGY 5 



used for feedback which tend to eliminate the query. There 
is a median of four nonrelevant documents used for feedback 
on the Cran 400 and of three documents on the Cran 200. In 
strategy 5 on the Cran 400, out of the twenty-two queries, seven 
queries have two or fewer concepts left after feedback whereas 
on the Cran 200 there are only four such queries. The larger 
multiplier for the original query in strategy 3 partially 
offsets this effect of erasing the query. 

As for strategy 2 which always uses one nonrelevant docu- 
ment for feedback, the Cran 200 precision improves while the 
Cran 400 precision does not (Fig. 3a). This is due to the fact 
that on the Cran 200 there would be mere relevant documents 
retrieved (median of 2); therefore, one nonrelevant document 
does not erase the query. On the Cran 400, however, fewer 
relevant documents would be retrieved (median of 1); herefore, 
one nonrelevant document might remove more concepts than are 
added by the relevant documents in the feedback. 

Looking at the precision improvement graphs for group 
B, Cran 200 and Cran 400 curves using strategy 1 (Fig. 4a) are 
interspersed whereas for strategy 4, the Cran 200 curve is 
usually higher (Fig. 5a). But looking at the percentage pre- 
cision graphs, for strategy 1 (Fig. 4b), the Cran 400 is better 
at all recall points. This is rot unexpected, since the ori- 
ginal precision of the Cran 400 is lower than that of the 
Cran 200. Therefore even with a similar increase in precision, 
from a system viewpoint, the feedback is more helpful in im- 
O ing retrieval for the Cran 400 since this larger collection 
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is not as favorable to retrieval as a smaller collection in 
the first place (lower original precision). For strategy 4 
(Fig. 5b), the two curves are interspersed instead of the 
Cran 400 being lower because once the original precision is 
taken into account, the percentage increase becomes similar. 

Theoretically, the fallout curves (see Appendix) for the 
two collections should be the same. However, there is probably 
a subset in the Cran 400 collection of nonrelevant documents 
which have a very low probability of being retrieved [4], 

This explains why fallout for the Cran 400 seems better, a fact 
to be remembered when comparing fallout values. 

For strategy 1, in the fallout improvement graph 
(Fig, 6a), Cran 200 is for the most part better. On the cor- 
responding percentage fallout improvement graph (Fig. 6b) the 
Cran 400 is slightly better. For stragety 4, on the other 
hand, the difference in fallout improvement is more pronounced 
and the percentage fallout improvement is more similar (Fig. 

7a, 7b). 

These fallout results are quite logical. Since on the 
feedback run, the number of relevant documents retrieved on the 
Cran 200 tends to be larger than for the Cran 400 (usually one 
more relevant document for Cran 200), the number of nonrelevant 
documents would be smaller. Therefore, the fallout improvement 
for the Cran 200 is larger. However, when the original fallout 
values are considered, the two collections become similar. 

Once precision for the Cran 200 is adjusted to that of 
the Cran 4 0 0 » the recall-precision curve for the Cran 400 is 
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is lower than that for the Cran 400 (see Appendix). Therefore, 
according to these graphs, from a system viewpoint, Cran 400 
definitely shows better performance. From the adjusted preci- 
sion improvement graphs (Fig, 3a, 9a), the improvement of 

Cran 400 is at least equal if not more than that of Cran 200. 
This result is also supported by .he percentage adjusted pre- 
cision improvement graphs (Fig. 8b, 9b). From both a user and 

a system viewpoint, it would appear that use of these feedback 
strategies is at least as effective for a larger collection 
(lower generality number). 

An interesting comparison can be made between strategies 
2 and 4 since they are similar in that both use negative feed- 
back of one nonrelevant document. However, the fact that 
strategy 4 uses negative feedback only when no positive feedback 
can be performed, as opposed to strategy 2 which uses it for 
all queries, causes strategy 4 to be effective and strategy 2 
to fail on the Cran 400. For strategy 4, the few relevant docu- 
ments used in feedback are not offset by any negative feedback 
as they would be for strategy 3 (see discussion of strategy 
2 above). 

4. Conclusion 

Results of this study are encouraging in that they seem 
to indicate that some feedback strategies can indeed be used in 
a realistic environment. Those commonly used strategies such as 
pure positive feedback and the strategy which uses the top 
ranking nonrelevant document only when no relevant documents are 
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retrieved, are equally effective on the Cran 200 and the Cran 
400 . 

It is generally believed that feedback on a collection 
of lower generality will not be as effective and that feedback 
on a collection as large as a library is not promising. However, 
the results of this study to seem to point out that relevance 
feedback would be operative on a library collection, contrary 
to common belief. Of course this is highly dependent on which 
feedback method is used, since some strategies (such as those 
using a large number of nonrelevant documents) perform poorly 
on collections of lower generality. Furthermore, as the fall- 
out curves indicate, the Cran 400 collection might have a dis- 
joint subset of documents never retrieved. Thus generality 
should be recomputed by removing such documents. In addition, 
the test collections used here are limited in that they pertain 
to only one subject area. 

A suggestion for future experiments is that queries 
should be examined individually to isolate irregular behavior. 
Also a larger query collection and document collection on more 
than one subject area would be advisable to substantiate the 
results. Based on the findings of this study, variations of 
the two feedback strategies in group B — e.g. requiring a 
constant number of relevant documents to be fed back or using 
different rank cut values — should be explored. 
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X. Selective Negative Feedback Methods 
M. Kerchner 



Abstract 

A great deal of work has already been done in automatic information 
retrieval in an effort to improve performance and to satisfy user needs. 

In particular various techniques have been described which modify the 
initial query submitted by the user, including the use of nonrelevant and 
relevant retrieved documents. The present study deals with experiments 
performed with several new methods of using nonrelevant retrieved documents 
to modify queries which retrieve no relevant in the first N documents 
retrieved. The results of the experiments are evaluated and suggestions 
are made for possible further investigations. 

1. Introduction 

Relevance feedback is a technique for improving the performance of 
an information retrieval system to better satisfy the needs of its users, 
fl] A search of the document collection is made with an initial query and 
a set of retrieved documents, ranked in order of correlation with the query, 
is presented to the user. After examining the set of retrieved documents, 
the user indicates whether each is relevant or not relevant to his query. 

[3] The relevance judgments are used by the system to modify the original 
search query in such a way that the modified query will retrieve additional 
relevant documents. 

Experiments have been made with several methods of positive rele- 
vance feedback in which highly ranked relevant documents are used to modi- 

o 




78 



X-2 



fy the query. [2, *4] In the case where no relevant documents are retrieved in 
the first N documents considered, negative relevance feedback — the use 
of nonrelevant documents for query mod5.f icat ion — has been the basis for 
experimentation. [1,2,5] However, some problems arise with the use of no n- 
relevant documents for query modification. Riddle et. al. [^] and Ide [5] con- 
firm that in some cases the use of nonrelevant documents perturbs the query 
vector so grossly that no additional relevant documents are retrieved in 
subsequent searches with the modified query. [6] 

In the present study, the SMART document retrieval system is used 
as the basis for experiments on methods which propose to deal with the above 
and related problems. 

2, Methodology 

It has been shewn by previous work that methods using positive rele- 
vance feedback are reasonably successful for queries retrieving at least 
one relevant document in the first N retrieved. Therefore, the experiments 
in this study are only concerned with those queries which retrieve no rele- 
vant in the first N (N=5) documents retrieved. 

To deal with the problem cf overdistort ion of the query which occurs 
with standard negative feedback schemes in which highly ranked nonrelevant 
documents are subtracted from the query, Johnson and Krablin [&] propose that 
more selective methods be used in order to "insure the integrity of the origi- 
nal relevant concepts in the query" and to move the query out of an are.t of 
nonrelevant concepts in the document space by using a series of selected 
terms for negative feedback. The approach suggested by Johnson and Krablin 
is to select those terms which appear in several of the highly correlated 

o 

ERJC irelevant documents., but not in the original query and to add these terms, 

fl 



X-3 



with negative weights, to the query. 

In connection with this approach, it is important to note that a 
large portion of normal queries covers more than one subject area. [7] 

In addition, concepts v:hich appear in highly correlated nonrelevant may 
also be significant in retrieved relevant documents. As a result, since 
the basic selective negative feedback strategy of Johnson and Krablin 
leaves untouched those concepts in the query which may have been found 
in several of the highly correlated nonrelevant documents (end, as noted, 
several of the relevant retrieved as well), the query appears to remain in 
approximately the same areaof the document space, as seen in Fig. 1. The 
highly correlated nonrelevant documents in the area may no longer be 
retrieved but the query also does not approach the documents relatirg to 
any secondary relevant subject area. The retrieval results confirm that 
most of the improvement obtained is caused by raising the ranks of the 
relevant documents in the primary subject area, and, in some cases, re- 
trieving several other relevant in the same part of the document space. 

In contrast, by removing those concepts in the query which are 
shown to be significant in the highly ranked nonrelevant documents, the query 
is moved from that part of the document space in which those documents 
appear, i.e. from an area of the space which is, in a sense, "more" non- 
relevant than relevant to the query. It is hypothesized, as shown in 
Fig. 1, that the query is moved nearer to the set of documents related to 
its second subject area since presumably, the concepts which remain in 1 he 
query relate to this ar^a and, by removing the other concepts (or decreasing 
their weights), the remaining (or more weighty) concepts now assume primary 
O importance in the query. In fact, a situation anaiagous to query splitting 
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o) Typicol SMART Retrieval b) Typical SMART Retrieval with 

relevance feedback to modify 
query 





c) Typical retrieval with query 
modified by selective negative 
feedback (Methods 1,4) 



O 

ERiC 



Selective Negative Feedbock Illustration 

Fig . 1 
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is achieved, although relevant documents in tho original area of the docu- 
ment space may now be overlooked. Howe'er, while missing these relevant 
documents, experiments show that the query is moved significantly nearer the 
second subject area and more new documents in this area are retrieved than 
would be the case if additional documents in the first subject area were 
retrieved by not modifying those selected concepts which appear in the 
query. 



3. Selective Negative Relevance Feedback Strategies 

The following procedure is used in testing the various selective 
negative feedback methods to be described. 

1. A full search is made with the original queries (Note: As 

mentioned above, or.ly those queries which retrieve no rele- 
vant in the first 5 documents retrieved are used in this 
study. ) 

2. Modify the query in one of the following ways (as summarized 
in Table 1): 

Method 1: Any concept which appears in at least 

3 of the first 5 ncnrelevant documents is 
deleted it it appears in the query. No 
new concepts are added to the query. 

Method 2*. Any concept which appears in at at least 
3 of the first 5 nonreievant documents is 
assigned a weight equal to the average of it ; 
weights in these documents multiplied by -1. 

If the concept appears in the query, its 
weight is replaced by the new rAlculated 
weight. If tne concept does not appear in the 
query, it is added to the query. 
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Method 3: This method is similar to Method 2 but if 

the selected concept appears in the query > 
the new (negative) weight of the concept is 
added to its present weight in the query. 

Method 4: This method is similar to Method 1 but a 

concept must appear in all 5 nonrelevant 
documents in order to be selected. 

Method 5: This technique is similar to Method 2 but 

a concept must appear in all 5 "onre levant 
documents to be selected. 

3. Search the document collection with the modified query, 
and repeal procedure of part 2. 

This process is baited when a satisfactory proportion of relevant 
documents are retrieved. 

For comparison, searches are made with the te^t queries using a 
standard method of negativo relevance feedback in which the nonrelevant 
document retrieved with rank 1 in the original search is subtracted f: m 
the query and a subsequent search is made with the modified query. Two 
feedback iterations are performed. 

In Methods 1 ; ni 4* the danger exists of reducing the query to the 
zero vector. It has been found that such reduction occurs after the second 
iteration of Method 1 with only 2 queries. However, the experiments per- 
formed indicate that two iterations are the maximum number desirable, as 
further iterations cause too much distortion in the query. 



4. The ' xpurl mental . ir onment 

The strategies outlined above have been tested or the Cranfi^ld 
^ lleclion of 424 document vector abstracts produ jd using a word form the- 
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Method 1: 



Method 2: 



Method 3: 



Method 4: 



Method 5: 



Any concept which appears in at least 3 of the 5 
nonre levant documents is deleted if it appears in 
the query. No new concepts are added to the query. 

Any concept which appears in at least 3 of the 5 
nonrelevant documents is assigned a weight equal to 
the average of its weight in the 5 documents multi- 
plied by -1. If the concept appears in the query, 
its weight is replaced by the calculated weight. 

If the concept does not appear in the query, it is 
added to the query. 

This method is similar to Method 2 but if the 
selected concept appears in the query, the calcu- 
lated (negative) weight of the concept is added to 
its present weight in the query. 

This method is similar to Method 1 but a concept 
must appear in all of the 5 nonrelevant documents 
in order to be selected. 

This method is similar to Method 3 but a concept 
must appear in all 5 of the nonrelevant documents 
to be selected. 



Five Proposed Selective Negative Feedback Schemes 



Table 1 
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saurus and 155 queries, 35 of which retrieve no relevant in the first five 
documents retrieved. These queries are used as the experimental base. 

In the experiment, 15 documents are shown to the user but only the first 
five are used for relevance feedback. 

5. Experimental Results 

Since it is hypothesized that modification of the query by the pro- 
posed methods moves it to a part of the document space vrhich represents the 
second subject area, it is important to consider the number of new relevant 
documents which ara retrieved in the first 15 documents, i. e. those which 
have not previously been shown to the user. (7,8} As seen in Table 2, 

Method 1 is the most successful in retrieving new relevant documents. In one 
iteration 24 relevant documents appear in the first 15 documents retrieved 
or 15*5% of the remaining relevant documents, with an average of 3.0 con- 
cepts deleted from each query. In two iterations, a total of 30 new rele- 
vant documents are shown to the user or 19.4% of the remaining relevant in 
the collection fer this particular set of queries. Method 4, which requires 
that a concept appear in all 5 nonrelevant documents in order to be deleted, 
retrieves 16 new documents or 10.3% of the remaining relevant, with an 
average of 1.6 concepts deleted from each query. Tne techniques which ? M 
concepts with negative weights to the query show inferior results. Method 2 
retrieves only 9 new documents or 5.8% of the remaining relevant whil' 

Method 3 retrieves 8 new relevant documents. Thus it appears that assigning 
a weight of zero to a concept, i. e., deleting it from the query, results 
in less distortion of the query than assigning it a negative weight. In 
addition, Methods 1 and 4, which both neglect to add new concepts with nega- 
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Method 1 
(1 iter. ) 


Method 1 
(2 iters.) 


Method 2 


Method 3 


Method 4 


Method 5 


Number of 

queries 

modified 


34 


34 


34 


34 


22 


22 


Number of 
relevant in 
first 5 
retrieved 


13 


14 


8 


9 


12 

p 


10 


Humber of 
relevant in 
first 15 
retrieved 


38 


28 


10 


10 


33 


21 


Number of 
new relevant 
in first 5 
retrieved 


13 


16 


8 


9 


11 


9 


Number of 
new relevant 
in first 15 
retrieved 


24 


30 


9 


9 


16 


13 


% of remaining 
relevant 
retrieved in 
first 15 


15.5 


19.4 


5.8 


I 

£ i 


10.3 


8.4 


Number of 
queries which 
retrieve at 
least 1 new 
relevant in 
the first 15 


1° 


24 


9 


9 


13 


11 



Coniparison of Methods 1-5 




Tab 1 e 2 
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tive weights to the query, are significantly more successful in retrieving 
new relevant documents than Methods 2, 3, and 5, which do add new concepts 
with negative weights. 

As seen in Fig. 2, the more selective modification technique of 
Method 4 results in higher precision figures at recall levels up to 0.5 
than those achieved by Method 1, although precision figures for Method 1 
are higher at the higher recall levels. It is also seen by examination of 
retrieval results that in some cases for Method 1, the ranks of relevant 
documents which are retrieved among the top 15 documents in the original 
search decrease significantly since, as hypothesized, the query is moving 
in a direction away from these highly correlated documents. As shown in 
Table 2 , for Method 1, 24 of the 38 relevant documents retrieved, or 63%, 
are new relevant documents. Since Method 4 leaves 13 queries unchanged, 
the high ranks of these relevant documents remain the same and thus help 
in achieving high precision figures for Method 4 at low recall levels. 

In the same way, Method 1 tends to push low ranking relevant documents 
lower* if these documents are in the area of the document space ircm which 
the query is being moved, as they tend to be. In fact, using Method 1, 47 
relevant documents which have a nonzero correlation with the queries are 
reduced to having a zero correlation with the modified queries after one 
iteration. It is to be noted that some of these relevant documents have 
been seen by the user, as they appear in the top 15 retrieved documents, but, 
nonetheless, such factors affect the precision and recall calculations. 

As seen in Table 3, the standard feedback technique of subtracting 
the nonrelevant document with rank 1 from the query only retrieves 13 new 
O vant documents after 2 iterations, or 8.4% of the remaining relevant. 
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o — o Method l (I iter.) 
a — a Method 4 
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Recoil -Precision Curve for Original Queries, 
Methods I and 4 

Fig. 2 
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Recoil-Precision Curve for Methods 2,3 and 5 



Fig. 3 
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Iteration 1 


Iteration 2 


Combined 


Number of relevant 
in first 5 retrieved 


3 


3 


4 


Number of relevant 
in first 15 
retrieved 


9 


12 


17 


Number of new 
relevant in first 
5 retrieved 


3 


1 


4 


Number of new 
relevant in first 
15 retrieved 


8 


5 


13 


% of remaining 
relevant retrieved 
in first 15 


5.2 


3,2 


8.4 


Number of queries 
which retrieve 
at least 1 new 
relevant in. the 
first 15 


5 


4 


9 


Average number 
of concepts 
subtracted from 
the query 


56.2 


35.9 


92.1 



Results for Nonselective Negative Feedback Scheme 



Table 3 
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The high average number of concepts subtracted from the query after two 
iterations, 92.1, may explain the poor performance as the query is pro- 
bably overperturbed. 



6. Evaluation of Experimental Results 

As the criteria cited above (number of new relevant retrieved, 
etc.) as well as the statistical T- and Wilcoxon Signed Rank tests favor 
Methods 1 and 4 significantly over Methods 2, 3, and 5, only the former are 
compared with the standard nonselective negative feedback scheme and with 
each other. 

According to the T-test, the differences in performance between 
Method 1 and Method 4 are statistically significant. Using measures of 
rank recall, log precision, normalized recall, normalized precision, and 
recall level averages, Method 4 is concluded to be "better" than Method 
1. The Wilcoxon Signed Rank test confirms this conclusion. 

The Sign test favors the nonselective negative feedback strategy 
over Method 1 while the same test favors Method 4 over nonselective nega- 
tive feedback. However, as noted above and by others, (7,8] several 
other factors must be considered in evaluating the various strategies. 

Methods 1 and 4 both perform better than the nonselective negative 
feedback scheme as reflected by the number of new relevant retrieved. 

This is also reflected in the standard precision-recall curves (see Tables 
2 and 3, Figs. 1 and 3). As noted previously, the improved precision- 
recall curves for these methods do not result from simply raising the 
ranks of already retrieved relevant for, as shown in Table 2, 63% of the 
relevant documents retrieved by Method 1 are new documents not seen before 
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by the user. For Method 4, 48% of the relevant documents retrieved are 
new. 



To determine which of Methods 1 or 4 is to be favored, it must be 
considered that although the precision-recall curve of Method 4 is higher 
than that of Method 1 at recall levels up to 0.5, the curve for Method 1 
shows higher precision at recall levels greater than 0.5, since more relevant 
are retrieved using Method 1 than if Method 4 is used. At low recall levels, 
precision may be improved by raising the ranks of relevant documents already 
shown to the user. As noted by Hall et al [7] and Cirillo et al (8], assuming 
that 15 documents are shown to the user, whether a relevant document is 
ranked 8 or 13 is not important to the user since he is shown both documents; 
it is in the higher ranks of relevant documents retrieved that Method 4 seems 
to show better performance figures than Method 1. 

It is, in addition, important to note that Method 4, due to its 
more selective modification procedure which requires that a concept appear 
in all 5 nonrelevant documents in order to be deleted from the query, fails 
to alter 13 of the 35 queries while Methou 1 modifies 34 of the 35 queries. 

For those queries which are modified, their performances as far as the 
number of new relevant documents retrieved are similar. Method 1 retrieves 
an average of .71 new documents per query and Method 4 retrieves an average 
of .73 new documents per query. 

Since neg ttivc feedback schemes are conceived for the purpose of 
dealing with problem queries, i.e. those which retrieve no relevant in the 
first 5 documents retrieved, and thus cannot be modified by positive feed- 
back schemes employing relevant documents, a strategy which leaves 37% of 
the queries unmodified must be considered unsatisfactory for the purpose 




91 



X-15 



for which it is designed. 

Therefore, it is recommended that Method 4, which deletes from the 
query those concepts which appear in ac least 3 of the 5 nonrelevant docu- 
ments, be used as a negative feedback scheme for those queries which re- 
trieve no relevant documents in the first 5 retrieved. However, as it is 
hypothesized in the present study that the large number of new relevant 
documents retrieved by queries modified by this strategy are obtained by 
moving the query to a new section of the document space, which represents 
its second subject area, it is necessary to perform further experiments to 
determine how to retrieve the relevant which remain unretrieved in that part 
of the document space which relates to its first subject area. A com- 
bination of such techniques would presumably result in significantly better 
retrieval results for the problem queries dealt with in this study. 
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XI. The Use of Past Relevance Decisions 
in Relevance Feedback 

L. Paavola 



Abstract 

A high degree of similarity may be expected to exist among 
documents judged to be relevant to the same query. This paper 

investigates some po ss ib i] i t i es for exploiting this potential 
similarity in relevance feedback. Runs are made on the ADI and 
Cranfield 424 collections of the SMART retrieval system. In 
these runs all "jointly relevant" documents are incorporated 
into feedback as if they were a single relevant document. Stan- 
dard r ec a 11 -p r ec is ion evaluation measures are used, and the per- 
formance of some individual queries is illustrated. Some direc- 
tions for further research are suggested. 

1. Introduction 

In the SMART system, statistical and syntactic analyses 
of search queries and documents are used for text analysis, and 
automatic comparisons of analyzed queries to documents or to 
sets of centroids of document clusters are used for the selec- 
tion of documents to be displayed to query authors. [1] However, 
the utility of these methods alone is severely limited, and 
attempts have been made to introduce subjective judgments into 
the retrieval process. The usual method, known as relevance 
feedback, uses a query author's decisions about the relevance 
O tc his query of specified documents in order to modify the vector 

ERIC 




XI-2 



representation of his query, [1, section 7-4] Oc cas iona.. ly such 
judgments are used to modify document vectors, [1] Methods which 
do not alter query or document vectors include query splitting [3] 
and query clustering. [4,5] 

2. Assumptions and Hypotheses 

This paper details another method of using the history of 
a system to improve its performance. The assumption is made that 
if a given document is known to he relevant to a query, another docu- 
ment is more likely to be relevant to the query if both have been 
judged relevant to some past query. It is further assumed that the 
number of such past occurrences of joint relevance may be a useful 
index to inter -document similarity. 

The following problems may be anticipated in such a system: 
the system may be handicapped in dealing with queries of a type which 
it has not encountered frequently earlier; user ideas of relevance 
and nonrelevance may differ widely; unless special measures are 
taken, documents which may be relevant to a given query but never 
initially retrieved (e.g, situations in which query splitting would 
be in order) may become increasingly less likely ever to be retrieved. 

The proposed method is expected to have the following advan- 
tages: general queries with a high number of relevant documents may 

establish a loose connection between documents of the same general 
subject area, while specific queries may set up stronger connections 
between more closely similar documents; toe system may function well 
fer the "average' 1 user, if queries do not vary too widely; groups 
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of documents of which all are relevant to each of several queries 
may be used to better performance. 



3. Experimental Method 

The procedure is first tried on the ADI collection of the 
SMART retrieval system, consisting of 82 documents and 35 queries, 
then on the Cranfield 424 collection, which has 424 documents and 
155 queries. In each case, the query collection is divided into 
two equal groups by random methods. The documents relevant to 
each query are known. From the relevance decisions for the 
queries in the first group a list is made for each document of 
the other documents with which it has been included in such de-- 
cisions and the number of times for each, as shown in Fig. 1. 

The other half of the query collection is used to make 
three searches of the entire document collection. The first 
search is a full search using unaltered query vectors. The 
second search incorporates in positive feedback those documents 
among the first five shown the user which are judged relevant by 
him. The third search alters the query vectors in the way de- 
scribed below. 

In general, the altered query is constructed according to 
the following formula: 



R JR 

a 0 q 0 + a l ( l D r > + a 2 ( l (a 3 n D + a 4 n J } D JR } 
1=1 i L 1=1 J u i * J i JK i 



O ere q e t',0 n leered query 
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the initial query 

the relevant documents among the top n (here n=5) , 
according to the ranking produced by the full search 

the number of such documents D_. 



JR. 

1 



JR 



- documents joint relevant to any of the D 



- the number of such documents D 



JR . 



= the number of D n to which a particular D has 
R i JR I 

been found to be joint relevant 

= total number of joint relevancy decisions of the 
particular D with any of the D 

J R i 



adjustable parameters 



(One may choose to Include in feedback only those D T which have 

J R i 

n greater than a certain minimum value.) 

J i 

An example of the use of this notation is given in Fig. 2. 
The particular coefficients that have been tried for the 



N 

J R 



Cranfield 424 collection are a = 100, a = 100, a = 100/ ( £ n ), 
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in the collection have an extremely large number of joint-rele- 
vant documents, while others have none. 

The successive definitions of the query of Fig. 2 are 
illustrated in Fig. 3. The query used for the pure positive 
feedback search and that used for the joint relevance search 
are always identical except that in the latter certain concepts 
have increased weight and other concepts are added. 

To obtain a final evaluation, the simple feedback and 
joint relevance runs are compared to each other (and to the full 
search) by the AVERAGE and VERIFY routines. 

4 . Evaluation 

The run on the ADI collection shows enough difference 
between the two methods to merit a run on the Cranfield collec- 
tion. The chi square probabilities were 0.0001 for the t-test; 
0.0483 for the sign test without ties, 1.0000 with ties; a . d 
0.0006 for the Wilcoxon test. 

Recall-precision for the Cranfield 424 collection are 
displayed in Fig. 4. The higher precision at low recall for 
simple positive feedback is probably due to the inability of a 
vector loaded with many concepts to be very accurate in choosing 
the hi ghes t - r ank ing documents, although performing well on the 
whole. From the graph of Fig. 4, it is seen that the s^nple 
feedback method is more advisable than the particular join" 
relevance strategy tried when only the ranking of the documents 
at the top is important. 

o 

Of the relevant documents which were changed in rank by 
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Search 1 

q = q 0 

Search 2 

q " q Q + 1 (6) + 1 (89) + 1 (400) 

Search 3 

q - 1 OOq Q + 100(6) + 100(89) + 100 (400) 

+ 100 (2(5) + 3(32) + 4(51) + 1(65) 

25 

+ 1(71) + 1(93) + 2(212) + 1(284) 

+ 2(312) + 3(400)) 

3(400), e.g., means document 400 is added in with 
weight 3 . 



Query Alteration 
Fig. 3 



o 

ERIC 



lOo 



XI -9 



a Full Seorch 
□ Simple Feedback 
o Joint Relevant 




Precision 

4 

1.0 



rig. 4 




lOu 



XI-10 



the joint relevance process, 59 obtain lower ranks and 127 re- 
ceive higher ranks. The probabilitv under the null hypothesis of 
a chi square larger than that observed is 0.0000 for the t-test, 
Wilcoxon test, and sign test without ties; for the sign test 
using ties the probability is 1.0000. The large number of ties 
can be attributed to the lack of joint relevance information to 
be added into many of the queries. 

Performance of the simple feedback and joint relevance 
searches are shown for several queries in Fig. 5. Sometimes the 
addition of joint relevance information does not substantially 
affect the af fectiveness of the query one way or the other (e.g. 
query 54). Sometimes it actually moves the query away from 
relevant documents (query 26). But often it produces dramatic 
improvement (query 77). Sometimes the improvement is due to the 
direct addition of relevant documents (query 13), some of which 
would have been more effective had they had greater weight. Some- 
times very few relevant documents are added, but the important 
concepts are nevertheless amplified by inclusion of joint rele- 
vance information (query 42). Sometimes the inclusion of both 
produces improvement (query 7). Sometimes the additions dilute 
the query (query 61). 

The above analysis supports the conclusion that inclusion 
of joint relevance information, even if restricted to the weight 
of one relevant document only, produces significant improvement. 

In evaluating this experiment one must keep in mind the 
differences between the experimental situation and an actual one. 

0 The results are biased positively by the fact that in an actual 
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application not all the documents relevant to a giv»n query would 
be shown to a user, identified as relevant, and added to the joint 
relevance lists. They are biased negatively by the fact that 
relevance i \ formation obtained by a query in the second half of 
the collection might often help a new query subsequently submitted 
in the second half; this effect could not be taken into account in 
the experimental design. A sounder though more laborious experi- 
ment would have been to run the entire query collection against 
the document collection, while updating the joint relevance lists 
after each query. Still more significant results would have been 
obtained had the joint relevance lists been composed of only those 
documents which a user might see and identify as relevant ^ However , 
such experiments are difficult to perform without adequate system 
support . 

5 . Conclusions 

The assumptions of part 2 are found to be largely justifiable, 
although the importance of the number of past joint relevance deci- 
sions should be further investigated. The danger of biasing the 
system toward one type of query is avoided, since the two halves 
of the query collection are fairly similar. The experiments are 
not extensive enough to detect isolation of documents. As expected, 
loose and strong connections are established by general and specific 
queries, respectively. The joint relevance procedure does take 
advantage ot document groups. And a partial but important answer 
to the weighting problem is that greater emphasis should be placed 
O on the joint relevant documents, although waysmust be found to coun- 
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teract the negative effects of such an increase. Pe 'haps this 
effect may be partially counteracted as the number of queries 
run through a system increases. 

In a new experiment, low-weight concepts might be elimi- 
nated from altered queries. Certainly better values for a^, 
a ^ , a^ i a ^ , and a ^ should be found. There may be possibilities 
for the use of joint-relevance information in negative feedback. 
Incorporation of the best known feedback strategies into the 
joint-relevance query alteration equation should be attempted. 
Perhaps high-frequency occurrence in joint-relevance lists of 
a document already known to be relevant should lead to a higher 
weighting of such a document. 

T he experimental data indicate that the use of joint 
relevance information is a valuable tool in information retries 
val, that more testing of procedures for using this information 
is in order, and that the nature of the tradeoff between compu- 
tational complexity and effectiveness of additional information 
must be determined for such procedures. 
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